vision language model
VLMimic: Vision Language Models are Visual Imitation Learner for Fine-grained Actions
Visual imitation learning (VIL) provides an efficient and intuitive strategy for robotic systems to acquire novel skills. Recent advancements in Vision Language Models (VLMs) have demonstrated remarkable performance in vision and language reasoning capabilities for VIL tasks. Despite the progress, current VIL methods naively employ VLMs to learn high-level plans from human videos, relying on pre-defined motion primitives for executing physical interactions, which remains a major bottleneck. In this work, we present VLMimic, a novel paradigm that harnesses VLMs to directly learn even fine-grained action levels, only given a limited number of human videos. Specifically, VLMimic first grounds object-centric movements from human videos, and learns skills using hierarchical constraint representations, facilitating the derivation of skills with fine-grained action levels from limited human videos. These skills are refined and updated through an iterative comparison strategy, enabling efficient adaptation to unseen environments. Our extensive experiments exhibit that our VLMimic, using only 5 human videos, yields significant improvements of over 27% and 21% in RLBench and real-world manipulation tasks, and surpasses baselines by more than 37% in long-horizon tasks. Code and videos are available on our anonymous homepage.
Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions
Recent advancements in large vision language models have demonstrated remarkable proficiency across a wide range of tasks. Yet, these models still struggle with understanding the nuances of human humor through juxtaposition, particularly when it involves nonlinear narratives that underpin many jokes and humor cues. This paper investigates this challenge by focusing on comics with contradictory narratives, where each comic consists of two panels that create a humorous contradiction. We introduce the YesBut benchmark, which comprises tasks of varying difficulty aimed at assessing AI's capabilities in recognizing and interpreting these comics, ranging from literal content comprehension to deep narrative reasoning. Through extensive experimentation and analysis of recent commercial or open-sourced large vision language models, we assess their capability to comprehend the complex interplay of the narrative humor inherent in these comics. Our results show that even the state-of-the-art models still struggle with this task. Our findings offer insights into the current limitations and potential improvements for AI in understanding human creative expressions.
VFM-VLM: Vision Foundation Model and Vision Language Model based Visual Comparison for 3D Pose Estimation
Sarowar, Md Selim, Kim, Sungho
Abstract--Vision Foundation Models (VFMs) and Vision Language Models (VLMs) have revolutionized computer vision by providing rich semantic and geometric representations. This paper presents a comprehensive visual comparison between CLIP based and DINOv2 based approaches for 3D pose estimation in hand object grasping scenarios. We evaluate both models on the task of 6D object pose estimation and demonstrate their complementary strengths: CLIP excels in semantic understanding through language grounding, while DINOv2 provides superior dense geometric features. Through extensive experiments on benchmark datasets, we show that CLIP based methods achieve better semantic consistency, while DINOv2 based approaches demonstrate competitive performance with enhanced geometric precision. Our analysis provides insights for selecting appropriate vision models for robotic manipulation and grasping, picking applications.
The promising potential of vision language models for the generation of textual weather forecasts
Steele, Edward C. C., Mane, Dinesh, Monti, Emilio, Orus, Luis, Chantrill-Cheyette, Rebecca, Couch, Matthew, Dale, Kirstine I., Eaton, Simon, Rangarajan, Govindarajan, Majlesi, Amir, Ramsdale, Steven, Sharpe, Michael, Smith, Craig, Smith, Jonathan, Yates, Rebecca, Ellis, Holly, Ewen, Charles
Despite the promising capability of multimodal foundation models, their application to the generation of meteorological products and services remains nascent. To accelerate aspiration and adoption, we explore the novel use of a vision language model for writing the iconic Shipping Forecast text directly from video-encoded gridded weather data. These early results demonstrate promising scalable technological opportunities for enhancing production efficiency and service innovation within the weather enterprise and beyond.
KV-Efficient VLA: A Method to Speed up Vision Language Models with RNN-Gated Chunked KV Cache
Xu, Wanshun, Zhuang, Long, Shan, Lianlei
Vision-Language-Action (VLA) models offer a unified framework for robotic perception and control, but their ability to scale to real-world, long-horizon tasks is limited by the high computational cost of attention and the large memory required for storing key-value (KV) pairs during inference, particularly when retaining historical image tokens as context. Recent methods have focused on scaling backbone architectures to improve generalization, with less emphasis on addressing inference inefficiencies essential for real-time use. In this work, we present KV-Efficient VLA, a model-agnostic memory compression approach designed to address these limitations by introducing a lightweight mechanism to selectively retain high-utility context. Our method partitions the KV cache into fixed-size chunks and employs a recurrent gating module to summarize and filter the historical context according to learned utility scores. This design aims to preserve recent fine-grained detail while aggressively pruning stale, low-relevance memory. Based on experiments, our approach can yield an average of 24.6% FLOPs savings, 1.34x inference speedup, and 1.87x reduction in KV memory. Our method integrates seamlessly into recent VLA stacks, enabling scalable inference without modifying downstream control logic.
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.51)
Jailbreaking Large Vision Language Models in Intelligent Transportation Systems
Das, Badhan Chandra, Jawad, Md Tasnim, Mia, Md Jueal, Amini, M. Hadi, Wu, Yanzhao
Large Vision Language Models (LVLMs) demonstrate strong capabilities in multimodal reasoning and many real-world applications, such as visual question answering. However, LVLMs are highly vulnerable to jailbreaking attacks. This paper systematically analyzes the vulnerabilities of LVLMs integrated in Intelligent Transportation Systems (ITS) under carefully crafted jailbreaking attacks. First, we carefully construct a dataset with harmful queries relevant to transportation, following OpenAI's prohibited categories to which the LVLMs should not respond. Second, we introduce a novel jailbreaking attack that exploits the vulnerabilities of LVLMs through image typography manipulation and multi-turn prompting. Third, we propose a multi-layered response filtering defense technique to prevent the model from generating inappropriate responses. We perform extensive experiments with the proposed attack and defense on the state-of-the-art LVLMs (both open-source and closed-source). To evaluate the attack method and defense technique, we use GPT-4's judgment to determine the toxicity score of the generated responses, as well as manual verification. Further, we compare our proposed jailbreaking method with existing jailbreaking techniques and highlight severe security risks involved with jailbreaking attacks with image typography manipulation and multi-turn prompting in the LVLMs integrated in ITS.
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- North America > United States > South Carolina (0.04)
- Information Technology > Security & Privacy (0.68)
- Transportation > Infrastructure & Services (0.61)
- Government > Military (0.46)
- Government > Regional Government (0.46)
Vision Language Models for Dynamic Human Activity Recognition in Healthcare Settings
Abid, Abderrazek, Ho, Thanh-Cong, Karray, Fakhri
As generative AI continues to evolve, Vision Language Models (VLMs) have emerged as promising tools in various healthcare applications. One area that remains relatively underexplored is their use in human activity recognition (HAR) for remote health monitoring. VLMs offer notable strengths, including greater flexibility and the ability to overcome some of the constraints of traditional deep learning models. However, a key challenge in applying VLMs to HAR lies in the difficulty of evaluating their dynamic and often non-deterministic outputs. To address this gap, we introduce a descriptive caption data set and propose comprehensive evaluation methods to evaluate VLMs in HAR. Through comparative experiments with state-of-the-art deep learning models, our findings demonstrate that VLMs achieve comparable performance and, in some cases, even surpass conventional approaches in terms of accuracy. This work contributes a strong benchmark and opens new possibilities for the integration of VLMs into intelligent healthcare systems. Code and dataset are available at: https://github.com/gouga10/
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States (0.04)
- North America > Canada > Ontario > Waterloo Region > Waterloo (0.04)
- Health & Medicine > Consumer Health (0.50)
- Information Technology > Security & Privacy (0.46)
Trust in Vision-Language Models: Insights from a Participatory User Workshop
Chiatti, Agnese, Piccolo, Lara, Bernardini, Sara, Matteucci, Matteo, Schiaffonati, Viola
With the growing deployment of Vision-Language Models (VLMs), pre-trained on large image-text and video-text datasets, it is critical to equip users with the tools to discern when to trust these systems. However, examining how user trust in VLMs builds and evolves remains an open problem. This problem is exacerbated by the increasing reliance on AI models as judges for experimental validation, to bypass the cost and implications of running participatory design studies directly with users. Following a user-centred approach, this paper presents preliminary results from a workshop with prospective VLM users. Insights from this pilot workshop inform future studies aimed at contextualising trust metrics and strategies for participants' engagement to fit the case of user-VLM interaction.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
- North America > United States > New York (0.04)
- Europe > Italy > Lombardy > Milan (0.04)
- (2 more...)
- Health & Medicine (0.68)
- Education (0.46)
RadDiagSeg-M: A Vision Language Model for Joint Diagnosis and Multi-Target Segmentation in Radiology
Li, Chengrun, Royer, Corentin, Luo, Haozhe, Wittmann, Bastian, Li, Xia, Hamamci, Ibrahim, Er, Sezgin, Sekuboyina, Anjany, Menze, Bjoern
Most current medical vision language models struggle to jointly generate diagnostic text and pixel-level segmentation masks in response to complex visual questions. This represents a major limitation towards clinical application, as assistive systems that fail to provide both modalities simultaneously offer limited value to medical practitioners. To alleviate this limitation, we first introduce RadDiagSeg-D, a dataset combining abnormality detection, diagnosis, and multi-target segmentation into a unified and hierarchical task. RadDiagSeg-D covers multiple imaging modalities and is precisely designed to support the development of models that produce descriptive text and corresponding segmentation masks in tandem. Subsequently, we leverage the dataset to propose a novel vision-language model, RadDiagSeg-M, capable of joint abnormality detection, diagnosis, and flexible segmentation. RadDiagSeg-M provides highly informative and clinically useful outputs, effectively addressing the need to enrich contextual information for assistive diagnosis. Finally, we benchmark RadDiagSeg-M and showcase its strong performance across all components involved in the task of multi-target text-and-mask generation, establishing a robust and competitive baseline. Code for this work is published at: github.com/RadDiagSeg
- Europe > Switzerland > Zürich > Zürich (0.14)
- Europe > Switzerland > Bern > Bern (0.04)
- Africa > Senegal > Dakar Region > Dakar (0.04)
- Research Report (0.50)
- Workflow (0.46)
- Health & Medicine > Diagnostic Medicine > Imaging (0.86)
- Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.68)
- Health & Medicine > Nuclear Medicine (0.67)
Med-VRAgent: A Framework for Medical Visual Reasoning-Enhanced Agents
Guo, Guangfu, Lu, Xiaoqian, Feng, Yue
Visual Language Models (VLMs) achieve promising results in medical reasoning but struggle with hallucinations, vague descriptions, inconsistent logic and poor localization. To address this, we propose a agent framework named Medical Visual Reasoning Agent (\textbf{Med-VRAgent}). The approach is based on Visual Guidance and Self-Reward paradigms and Monte Carlo Tree Search (MCTS). By combining the Visual Guidance with tree search, Med-VRAgent improves the medical visual reasoning capabilities of VLMs. We use the trajectories collected by Med-VRAgent as feedback to further improve the performance by fine-tuning the VLMs with the proximal policy optimization (PPO) objective. Experiments on multiple medical VQA benchmarks demonstrate that our method outperforms existing approaches.
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)